fix to_netcdf append bug (GH1215) #1609

jhamman · 2017-10-04T21:05:29Z

Closes to_netcdf() fails to append to an existing file #1215
Tests added / passed
Passes git diff upstream/master | flake8 --diff
Fully documented, including whats-new.rst for all changes and api.rst for new API

TODO:
Additional tests needed to verify this works for all the write backends and that the correct errors are raised when dims are different between writes.

fmaussion · 2017-10-10T09:12:57Z

Thanks @jhamman !

With this fix existing variables will silently be ignored and won't be written, right? Maybe the expected behaviour (or specs) of the "append" option should be written somewhere in the docs and tested to avoid future regressions like we had in #1215

jhamman · 2017-10-10T15:50:09Z

@fmaussion - you bring up a good point. There are two scenarios here.

appending to a file with existing data variables
appending to a file with existing coordinate variables

I'm wondering if we should disallow 1, in favor of being more explicit. So:

list_of_vars_to_append = ['var1', 'var2']
ds[list_of_vars_to_append].to_netcdf(filename, mode='a')
# if either var1 or var2 are in filename, raise an error?

as apposed to the current behavior which would silently skip all vars already in filename and not in list_of_vars_to_append.

fmaussion · 2017-10-11T07:31:53Z

Netcdf4 would overwrite in this situation, and I am also in favor of overwriting as this could be quite a useful usecase:

from netCDF4 import Dataset
with Dataset('test.nc', 'w', format='NETCDF4') as nc:
    nc.createDimension('lat', 1)
    nc.createVariable('lat', 'f4', ('lat',))
    nc['lat'][:] = 1
with Dataset("test.nc", "a") as nc:
    nc['lat'][:] = 2

jhamman · 2017-10-18T23:17:48Z

@fmaussion - I've updated the append logic slightly. I'm wondering what you think? This version more aggressively overwrites existing variables (data_vars and coords).

fmaussion · 2017-10-19T14:43:48Z

Thanks! I like it: simple and in accordance with netcdf4's silent overwriting. It would be cool to describe this behavior this in the documentation somewhere, and add a test that the data is correctly overwritten maybe?

jhamman · 2017-10-21T04:35:55Z

Question for those who are familiar with the scipy backend. I have a few failing tests here on the scipy backend and I'm not really sure what's going on. It seems like it could be related to the mmap feature in scipy.

shoyer · 2017-10-21T06:06:16Z

@jhamman I think this is something to do with string/bytes. The data gets written as Unicode strings but then read back in as bytes, which is obviously not ideal.

If you want to ignore this, like the current tests, you can use assert_allclose() which has a flag for ignoring string/bytes issues (yeah, I know it's a nasty hack). In the long run we need to figure out how to solve issues like #1638. It seems like Unidata/netcdf-c#402 is likely relevant.

fmaussion · 2017-10-21T08:55:37Z

doc/io.rst

@@ -176,6 +176,10 @@ for dealing with datasets too big to fit into memory. Instead, xarray integrates
 with dask.array (see :ref:`dask`), which provides a fully featured engine for
 streaming computation.

+It is possible to append or overwrite netCDF variables using the ``mode='a'``
+argument. When using this option, all variables in the dataset will be written
+to the netCDF file, regardless if they exist in the original dataset.


for clarity: in the original file

fmaussion · 2017-10-21T08:56:30Z

xarray/core/dataset.py

@@ -963,7 +963,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
            default format becomes NETCDF3_64BIT).
        mode : {'w', 'a'}, optional
            Write ('w') or append ('a') mode. If mode='w', any existing file at
-            this location will be overwritten.
+            this location will be overwritten. If mode='a', exisitng variables
+            will be over written.


overwritten in one word?

shoyer · 2017-10-23T15:41:24Z

xarray/tests/test_backends.py

+            data['var9'] = data['var2'] * 3
+            data[['var2', 'var9']].to_netcdf(tmp_file, mode='a',
+                                             engine='scipy')
+            actual = open_dataset(tmp_file, engine='scipy')


This is triggering a test failure on Windows, where you apparently can't open the same file twice.

shoyer · 2017-10-23T15:50:10Z

xarray/tests/test_backends.py

+            data['var9'] = data['var2'] * 3
+            data[['var2', 'var9']].to_netcdf(tmp_file, mode='a',
+                                             engine='netcdf4')
+            actual = open_dataset(tmp_file, autoclose=self.autoclose,


Rather than allowing a clean-up failure, please close this file if possible (use a context manager). File descriptors are still a limited resource for our test suite.

shoyer · 2017-10-23T15:52:53Z

xarray/tests/test_backends.py

@@ -823,6 +823,37 @@ def roundtrip(self, data, save_kwargs={}, open_kwargs={},
                              autoclose=self.autoclose, **open_kwargs) as ds:
                yield ds

+    @contextlib.contextmanager
+    def roundtrip_append(self, data, save_kwargs={}, open_kwargs={},


Can we put mode of these tests in one of the base classes, so we don't separately repeat them for scipy, netcdf4 and h5netcdf?

shoyer · 2017-10-23T15:55:30Z

xarray/backends/common.py

-            target, source = self.prepare_variable(
-                name, v, check, unlimited_dims=unlimited_dims)
+            if (vn not in self.variables or
+                    (getattr(self, '_mode', False) != 'a')):


I don't think we need the explicit check here for the mode here. With mode='w', the file should already be starting from scratch.

jhamman · 2017-10-24T05:56:52Z

xarray/tests/test_backends.py

+                mode = 'a' if i > 0 else 'w'
+                data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs)
+            with open_dataset(tmp_file,
+                              autoclose=self.autoclose, **open_kwargs) as ds:


@shoyer - I've put these tests in a base class now. I've never quite understood how these tests work though. Are the inherited classes passing open/save_kwargs that specify the engine? How does this work?

Yes, that's exactly right.

One day it would be nice to split apart the gigantic test_backends.py into separate files for each backend...

jhamman · 2017-10-24T05:58:07Z

xarray/tests/test_backends.py

+            data['var9'] = data['var2'] * 3
+            data[['var2', 'var9']].to_netcdf(tmp_file, mode='a')
+            with open_dataset(tmp_file, autoclose=self.autoclose) as actual:
+                assert_identical(data, actual)


Regarding the issue with windows. What's the best way to do this then? Is it possible to explicitly close a netCDF file written with to_netcdf?

to_netcdf() should write and close the file. You only need to use the context manager to ensure things get closed with open_dataset().

jhamman · 2017-10-24T14:17:27Z

@shoyer - ready for final review.

shoyer · 2017-10-24T16:31:20Z

xarray/core/dataset.py

@@ -974,7 +974,8 @@ def to_netcdf(self, path=None, mode='w', format=None, group=None,
            default format becomes NETCDF3_64BIT).
        mode : {'w', 'a'}, optional
            Write ('w') or append ('a') mode. If mode='w', any existing file at
-            this location will be overwritten.
+            this location will be overwritten. If mode='a', exisitng variables


exisitng -> existing

shoyer · 2017-10-24T16:37:49Z

xarray/tests/test_backends.py

+                allow_cleanup_failure=allow_cleanup_failure) as tmp_file:
+            for i, key in enumerate(data.variables):
+                mode = 'a' if i > 0 else 'w'
+                data[[key]].to_netcdf(tmp_file, mode=mode, **save_kwargs)


This is currently always using the default backend. You need to explicitly set engine here somewhere, though much of the logic can potentially live on the base class or helper function of some sort. There's no magic that sets this automatically for subclasses.

Take a look at how roundtrip() looks. I think I led you astray by suggesting that you move everything to the baseclass.

jhamman · 2017-10-24T17:31:38Z

xarray/tests/test_backends.py

@@ -592,6 +627,8 @@ def create_tmp_files(nfiles, suffix='.nc', allow_cleanup_failure=False):

 @requires_netCDF4
 class BaseNetCDF4Test(CFEncodedDataTest):
+    engine = 'netcdf4'
+


@shoyer, what do you think about this solution?

shoyer · 2017-10-24T18:55:58Z

xarray/tests/test_backends.py

@@ -1024,6 +1030,8 @@ class ScipyFilePathTestAutocloseTrue(ScipyFilePathTest):

 @requires_netCDF4
 class NetCDF3ViaNetCDF4DataTest(CFEncodedDataTest, Only32BitTypes, TestCase):
+    engine = 'netcdf4'


This isn't quite enough -- you also need to set the format in this case. Take a look at roundtrip() below.

Probably the cleanest fix would be refactor roundtrip() into three methods:

# on the base class @contextlib.contextmanager def roundtrip(self, data, save_kwargs={}, open_kwargs={}, allow_cleanup_failure=False): with create_tmp_file( allow_cleanup_failure=allow_cleanup_failure) as path: self.save(data, path, **save_kwargs) with self.open(path, **open_kwargs) as ds: yield ds # on subclasses, e.g., for NetCDF3ViaNetCDF4DataTest def save(self, dataset, path, **kwargs): dataset.to_netcdf(tmp_file, format='NETCDF3_CLASSIC', engine='netcdf4', **kwargs) @contextlib.contextmanager def open(self, path, **kwargs): with open_dataset(tmp_file, engine='netcdf4', autoclose=self.autoclose, **open_kwargs) as ds: yield ds

Then you could write roundtrip_append() in terms of save and open.

jhamman · 2017-10-25T02:56:55Z

@shoyer - take another look. I have basically merged our two ideas and refactored the roundtrip tests. Tests are still failing but not for py2.7, on appveyor, or py3.6 locally.

shoyer · 2017-10-25T05:09:01Z

@jhamman This looks great, thank you!

fix to_netcdf append bug (GH1215)

231ad51

jhamman added topic-backends bug labels Oct 4, 2017

jhamman requested a review from fmaussion October 4, 2017 21:05

Joe Hamman added 2 commits October 9, 2017 15:30

Merge branch 'master' of github.com:pydata/xarray into fix/1215

0a519a9

fix for inmemorystore

7f5e96f

Joe Hamman added 2 commits October 18, 2017 15:38

Merge branch 'master' of github.com:pydata/xarray into fix/1215

d13d48c

overwrite existing vars

3625035

Joe Hamman added 2 commits October 20, 2017 19:10

Merge branch 'master' of github.com:pydata/xarray into fix/1215

779a4b1

more tests and docs

1782108

assert_allclose for scipy backend

0418264

fmaussion approved these changes Oct 21, 2017

View reviewed changes

doc fixes

d3e2b97

shoyer mentioned this pull request Oct 23, 2017

Roundtrip unicode strings even when written as character arrays #1648

Merged

4 tasks

shoyer reviewed Oct 23, 2017

View reviewed changes

Joe Hamman added 3 commits October 23, 2017 22:34

Merge branch 'master' of github.com:pydata/xarray into fix/1215

04671e6

remove check for append mode

faa5098

cleanup tests part 1

6b4a30d

jhamman commented Oct 24, 2017

View reviewed changes

shoyer reviewed Oct 24, 2017

View reviewed changes

add engine attribute to test cases

d801940

Joe Hamman added 2 commits October 24, 2017 10:25

one more fix

6347cad

pep8

6ea5042

jhamman commented Oct 24, 2017

View reviewed changes

shoyer reviewed Oct 24, 2017

View reviewed changes

Joe Hamman added 3 commits October 24, 2017 15:06

refactor roundtrip tests

5a68bc5

mods for pynio

92c49a8

fix multiple kwargs

0b1efd2

jhamman mentioned this pull request Oct 24, 2017

Test suite is failing on master: No module named 'hypothesis.extra.pytestplugin' #1655

Closed

fix for pynio

09101d6

shoyer merged commit f01d698 into pydata:master Oct 25, 2017

This was referenced Dec 21, 2017

BUG: set_variables in backends.commons loads target dataset #1798

Closed

move backend append logic to the prepare_variable methods #1799

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix to_netcdf append bug (GH1215) #1609

fix to_netcdf append bug (GH1215) #1609

jhamman commented Oct 4, 2017 •

edited

Loading

fmaussion commented Oct 10, 2017

jhamman commented Oct 10, 2017

fmaussion commented Oct 11, 2017

jhamman commented Oct 18, 2017

fmaussion commented Oct 19, 2017

jhamman commented Oct 21, 2017

shoyer commented Oct 21, 2017

fmaussion Oct 21, 2017

fmaussion Oct 21, 2017

shoyer Oct 23, 2017

shoyer Oct 23, 2017

shoyer Oct 23, 2017

shoyer Oct 23, 2017

jhamman Oct 24, 2017

shoyer Oct 24, 2017

jhamman Oct 24, 2017

shoyer Oct 24, 2017

jhamman commented Oct 24, 2017

shoyer Oct 24, 2017

shoyer Oct 24, 2017

jhamman Oct 24, 2017

shoyer Oct 24, 2017

jhamman commented Oct 25, 2017

shoyer commented Oct 25, 2017

fix to_netcdf append bug (GH1215) #1609

fix to_netcdf append bug (GH1215) #1609

Conversation

jhamman commented Oct 4, 2017 • edited Loading

fmaussion commented Oct 10, 2017

jhamman commented Oct 10, 2017

fmaussion commented Oct 11, 2017

jhamman commented Oct 18, 2017

fmaussion commented Oct 19, 2017

jhamman commented Oct 21, 2017

shoyer commented Oct 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Oct 24, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Oct 25, 2017

shoyer commented Oct 25, 2017

jhamman commented Oct 4, 2017 •

edited

Loading